-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
named-types: 1st pass on klay_beam dataclass primitives #41 #68
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you're going with this design pattern, however personally i had something slightly different in mind. The base Datum
class is powerful in that it avoids duplication and creates a structured interface for passing around data. However, I don't think it really helps us achieve (what I thought to be) the main objective of using these data structures to validate types and values that are passed between transforms.
This would suggest a much less generic setup but instead Transform
-specific data types, i.e. TorchAudioTuple
, TorchFeatureTuple
, NumpyAudioTuple
with type and potentially even value validation. There's definitely room to use a base class in the way you suggested but I would want to see how this would trickle down into child classes that both provide an interface but also validate data being passed between transforms.
"""General data storage returned by transforms that load time-series numpy | ||
data such as LoadWithLibrosa""" | ||
|
||
datum: Tuple[np.ndarray, int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this conflicting with the name, NumpyDatum
is suggesting that this is a generic numpy
dataclass. However the fact that this is a Tuple
means you're expecting an audio array and a sample rate value here. If that's the case this should be an AudioDatum
. Alternatively I would suggest adding sample rate as a separate field, or I would just leave the datum
field and allow another child class called AudioNumpyDatum
.
This applies to the TorchDatum
also and is important because there's scenarios where we're not passing around numpy
arrays that are audio e.g. feature arrays.
Got it, thanks Max. If I understand correctly, you are proposing each Transform in Klay Beam could have its own Dataclass that has the exact properties relevant to that transform. For example @dataclass(frozen=True)
class TimeSeriesTorchData:
source_dir: str
source_path: str
sample_rate: int
tensor: torch.Tensor This is still somewhat resuable–we could write a One of the problems that I want to solve is that Transforms should be able to forward little bits of metadata from their input elements to subsequent transforms in the pipeline. One example is I would like Transforms (like If every Transform in the pipeline has rigid input and output types, I'm worried that:
Let's think together if there is a better way to accomplish these goals while also maintaining more type safety than the current proposal. Maybe the answer is to just start writing the updated transforms and pipelines, and see what approach goes smoother. |
After thinking about it some more, it seems like including #41 in v1.0 is worthwhile. Here's a proposal for the data types that could be passed between core
klay_beam
transforms.The next step it so try re-writing some of the transforms to use the classes, but before I do that it would be really helpful to have a quick review of the choices here, which are explained in the docstrings.